I’m looking at games sales data - This is hopefully going to look like a combination of total sales stats, trends by genre (if there is that much detail), reviews(?)
Haven’t fully decided on the question I want to tackle, but I think I’m approaching the scenario in a similar way to the brief
“Small games company want to understand what types of games sell a lot of copies. In particular they are looking for analysis that helps them decide which direction to take their company in.”
I’m not very happy with the dataset, as the landscape in 2019/16 was vastly different to 2023.
Games as a service as a model still isn’t dead, so the traditional method of looking at Sales to judge success might not be enough
Subscription services - chiefly Game Pass, but nowadays you don’t really need to buy a game you’re interested in - Similar model to netflix
DLC and post-launch content - You aren’t finished buying a game when you pay for it at launch. Paid content is drip fed for up to several years after a games launch
Missing sales numbers - As far as I’m aware, a developer/publisher doesn’t have to release sales numbers.
Missing titles - It might be hard to believe, but there are many more games out there than FIFA, Skyrim and Mario Kart. Smaller or more niche titles may not be represented even if they are a better model for a smaller company to emulate
Steam alone contains over 50 thousand games (granted, of extremely varying quality)
Measures of success - It simply isn’t fair to judge success between some titles. Comparing Lethal League Blaze to Super Smash Brothers Ultimate is not a level playing field.
Before I even look at the data, I’m expecting some bias - Newer consoles might not be represented in data, which might “suggest” that it’s better to develop a title for an older system, as they have “more success”.
Some titles will blow others out the water in terms of sales, simply due to them being multi-platform. For example, a successful PS4 exclusive title will likely not sell as many copies as a successful multi-platform title
One of the biggest influences on sales will be Title/Franchise - If a game is called “Pokemon”, chances are it’ll do well. This is not something that a small company will be able to emulate, unless they are looking to get sued.
Currently i have a few datasets that might be useful for supplementing the 2019 data from one of the example briefs
Hypothesis test - Franchise v non-Franchise
library(tidyverse)
library(tidytext)
library(ggwordcloud)
library(janitor)
steam <- read_csv("raw_data/steam.csv") # absolutely massive
Rows: 27075 Columns: 18── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (8): name, developer, publisher, platforms, categories, genres, steamspy_tags, owners
dbl (9): appid, english, required_age, achievements, positive_ratings, negative_ratings, average_playtime, median_playtime, price
date (1): release_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
games_1 <- read_csv("raw_data/Video_Games.csv")
Rows: 16719 Columns: 16── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Name, Platform, Year_of_Release, Genre, Publisher, User_Score, Developer, Rating
dbl (8): NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales, Critic_Score, Critic_Count, User_Count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
games_2 <- read_csv("raw_data/Video_Games_Sales.csv")
Rows: 16719 Columns: 16── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Name, Platform, Year_of_Release, Genre, Publisher, User_Score, Developer, Rating
dbl (8): NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales, Critic_Score, Critic_Count, User_Count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
games_raw <- read_csv("raw_data/Raw Data GVGS&R.csv")
Rows: 16719 Columns: 16── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Name, Platform, Year_of_Release, Genre, Publisher, User_Score, Developer, Rating
dbl (8): NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales, Critic_Score, Critic_Count, User_Count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
games_cleaned <- read_csv("raw_data/Cleaned Data 2 GVGS&R.csv")
Rows: 6894 Columns: 15── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Genre, Publisher, Developer, Rating
dbl (10): Year_of_Release, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales, Critic_Score, Critic_Count, User_Score, User_Count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
games_raw %>%
arrange(Name) %>%
drop_na(Global_Sales) # No NA's in global sales - good start
steam_tags <- steam %>%
select(steamspy_tags) %>%
separate(col = steamspy_tags, sep = ";", into = c("tags", "tags_2", "tags_3")) %>%
pivot_longer(col = starts_with("tags"), values_to = "all_tags") %>%
select(all_tags) %>%
drop_na() %>%
count(all_tags)
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2665 rows [83, 95, 96, 98, 102, 103, 106, 119, 123, 124, 125, 161, 211, 212, 213, 224, 249, 250, 261, 262, ...].
ggwordcloud(words = steam_tags$all_tags, freq = steam_tags$n, random.color = TRUE, colors = c("#ff595e", "#ffca3a", "#8ac926", "#1982c4", "#6a4c93"))
steam %>%
arrange(desc(name))
games_backloggd <- read_csv("raw_data/games backloggd.csv")
New names:Rows: 1512 Columns: 14── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Title, Release Date, Team, Times Listed, Number of Reviews, Genres, Summary, Reviews, Plays, Playing, Backlogs, Wishlist
dbl (2): ...1, Rating
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
games_backloggd <- games_backloggd %>%
clean_names()
# dropping unreleased games for now - only 3 of them, so won't impact anything too much
# converting release_date into a date and adding a years since release column - This could be usable to discount certain titles from a model
# e.g if there WAS a trend in games 10~ years ago, it may not be applicable today, and might produce incorrect guidance
# splitting genres into separate columns : only one game has 7 genre tags.
games_backloggd <- games_backloggd %>%
arrange(desc(rating)) %>%
filter(release_date != "releases on TBD") %>%
mutate(release_date = mdy(release_date),
time_since_release = as.period(today() - release_date),
years_since_release = as.numeric(time_since_release, "years"), .after = release_date) %>%
mutate(years_since_release = round(years_since_release, digits = 2)) %>%
select(!time_since_release) %>%
separate(col = genres, sep = "', '", into = c("genre_tag", "genre_tag_2", "genre_tag_3", "genre_tag_4", "genre_tag_5", "genre_tag_6", "genre_tag_7"))
Warning: Expected 7 pieces. Missing pieces filled with `NA` in 1508 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# removing the opening and closing brackets and apostrophes across the genre columns
games_backloggd <- games_backloggd %>%
mutate(genre_tag = str_replace_all(genre_tag, pattern = "\\['", replacement = ""),
across(starts_with("genre_tag"), ~ str_replace_all(., pattern = "'\\]", replacement = "")))
games_backloggd
NA
# Converted the below code into a function so i can reuse it
character_k_to_numeric <- function(x) {
x = enquo(x)
games_backloggd %>%
mutate(multiplier = case_when(
str_detect(!!x, pattern = "K$") ~ TRUE,
TRUE ~ FALSE)) %>%
mutate(!!x := str_remove(string = !!x, pattern = "K$")) %>%
mutate(!!x := as.numeric(!!x)) %>%
mutate(!!x := case_when(
multiplier == TRUE ~ !!x*1000,
TRUE ~ !!x
)) %>%
select(!multiplier)
}
# converts these columns into numerics
games_backloggd <- character_k_to_numeric(x = number_of_reviews)
games_backloggd <- character_k_to_numeric(x = plays)
games_backloggd <- character_k_to_numeric(x = wishlist)
games_backloggd <- games_backloggd %>%
select(!plays, !backlogs, !times_listed)
chunk below is original attempts at converting the above columns
# games_backloggd <- games_backloggd %>%
# select(!times_listed)
# mutating number of reviews to be numeric
# In this horrible step, I'm using str_detect and case_when to create a column called multiplier - this will be used later
# then i drop the "K"'s, and convert the column to a numeric
# then multiply the columns values by 1000 IF the multiplier column is TRUE
# games_backloggd %>%
# mutate(multiplier = case_when(
# str_detect(number_of_reviews, pattern = "K$") ~ TRUE,
# TRUE ~ FALSE
# ),.after = number_of_reviews) %>%
# mutate(number_of_reviews = str_remove(number_of_reviews, pattern = "K$")) %>%
# mutate(number_of_reviews = as.numeric(number_of_reviews)) %>%
# mutate(number_of_reviews = case_when(
# multiplier == TRUE ~ number_of_reviews*1000,
# TRUE ~ number_of_reviews
# )) %>%
# select(!multiplier)
#
# # Oh No i need to do this to more columns
#
# games_backloggd %>%
# mutate(multiplier = case_when(
# str_detect(plays, pattern = "K$") ~ TRUE,
# TRUE ~ FALSE
# ),.after = plays) %>%
# mutate(plays = str_remove(plays, pattern = "K$")) %>%
# mutate(plays = as.numeric(plays)) %>%
# mutate(plays = case_when(
# multiplier == TRUE ~ plays*1000,
# TRUE ~ plays
# )) %>%
# select(!multiplier)
# making it a function because that's easier
# character_k_to_numeric <- function(dataframe, column) {
# mutate(multiplier = case_when(
# str_detect(column, pattern = "K$") ~ TRUE,
# TRUE ~ FALSE
# ),.after = column) %>%
# mutate(column = str_remove(column, pattern = "K$")) %>%
# mutate(column = as.numeric(column)) %>%
# mutate(column = case_when(
# multiplier == TRUE ~ column*1000,
# TRUE ~ column
# )) %>%
# select(!multiplier)
# }
# tidying up team column
games_backloggd <- games_backloggd %>%
mutate(team = str_replace_all(team, pattern = "\\['", replacement = ""),
team = str_replace_all(team, pattern = "'\\]", replacement = ""),
team = str_replace_all(team, pattern = "\\', \\'", replacement = ", "))
# keeping these separate incase i need them
reviews_backloggd <- games_backloggd %>%
select(title, summary, reviews)
# cutting these columns for now - if needed, i'll reattach them
games_backloggd <- games_backloggd %>%
select(!x1)
games_backloggd <- games_backloggd %>%
select(-summary, -times_listed, -reviews, -playing, -backlogs)
# removing duplicated rows
games_backloggd <- games_backloggd %>%
unique()
games_backloggd_clean <- games_backloggd
write_csv(games_backloggd_clean, "clean_data/backloggd_clean.csv")
games_1
games_2
These seem to be the same thing
going with games_1 and renaming it something more sensible
games_sales <- games_1
games_sales <- games_sales %>%
clean_names()
games_sales %>%
filter(name == "Minecraft") %>%
count(sum(global_sales))
games_sales %>%
group_by(name) %>%
Error: Incomplete expression: games_sales %>%
group_by(name) %>%
vg_chartz_feb23 <- read_csv("raw_data/game_statistics_feb_2023.csv")
Rows: 62326 Columns: 16── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (15): title, total_sales, total_shipped, publisher, developer, release_date, platform, japan_sales, na_sales, other_sales, pal_sales, user_score, vgchartz_score, critic_score, last_update
dbl (1): pos
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Platform is a little unhelpful. “All” does not mean all platforms (unless Elden Ring released on Switch, or the NES) “Series” might be useful as a look-up for another table though
steam
# i don't know much about titles that aren't in english - are all the ones in this dataset english?
steam <- steam %>%
filter(english == 1) %>%
select(!english)
# approx 1000 aren't - but who are they?
steam %>%
filter(english == 0) %>%
filter(owners != "0-20000")
Error in `filter()`:
ℹ In argument: `english == 0`.
Caused by error:
! object 'english' not found
Backtrace:
1. steam %>% filter(english == 0) %>% ...
4. dplyr:::filter.data.frame(., english == 0)
5. dplyr:::filter_rows(.data, dots, by)
6. dplyr:::filter_eval(...)
8. mask$eval_all_filter(dots, env_filter)
9. dplyr (local) eval()
# make platforms wider - windows_support = TRUE/FALSE, mac_support = TRUE/FALSE, linux_support = TRUE/FALSE
steam <- steam %>%
separate(col = platforms, sep = ";", into = c("platform_1", "platform_2", "platform_3"))
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 21957 rows [21, 25, 27, 30, 34, 35, 36, 37, 38, 39, 40, 41, 42, 45, 46, 47, 48, 49, 50, 51, ...].
# this is verbose and clunky, but it catches everything so whatever
steam <- steam %>%
mutate(windows_support = case_when(
platform_1 == "windows" ~ TRUE,
platform_2 == "windows" ~ TRUE,
platform_3 == "windows" ~ TRUE,
TRUE ~ FALSE ),.after = publisher) %>%
mutate(mac_support = case_when(
platform_1 == "mac" ~ TRUE,
platform_2 == "mac" ~ TRUE,
platform_3 == "mac" ~ TRUE,
TRUE ~ FALSE), .after = windows_support) %>%
mutate(linux_support = case_when(
platform_1 == "linux" ~ TRUE,
platform_2 == "linux" ~ TRUE,
platform_3 == "linux" ~ TRUE,
TRUE ~ FALSE), .after = mac_support) %>%
select(-platform_1, -platform_2, -platform_3)
# change owners to be more readable
steam <- steam %>%
mutate(owners = case_when(
owners == "0-20000" ~ "below 20k",
owners == "20000-50000" ~ "20k to 50k",
owners == "50000-100000" ~ "50k to 100k",
owners == "100000-200000" ~ "100k to 200k",
owners == "200000-500000" ~ "200k to 500k",
owners == "500000-1000000" ~ "500k to 1M",
owners == "1000000-2000000" ~ "1M to 2M",
owners == "2000000-5000000" ~ "2M to 5M",
owners == "5000000-10000000" ~ "5M to 10M",
owners == "10000000-20000000" ~ "10M to 20M",
owners == "20000000-50000000" ~ "20M to 50M",
owners == "50000000-100000000" ~ "50M to 100M",
owners == "100000000-200000000" ~ "100M to 200M"
))
steam %>%
filter(owners == "10M to 20M")
# ADDING A COLUMN TO FIND THE FREE TO PLAY STUFF
steam <- steam %>%
mutate(free_to_play = case_when(
str_detect(genres, "Free to Play") ~ TRUE,
str_detect(steamspy_tags,"Free to Play") ~ TRUE,
# price == 0.00 ~ TRUE,
TRUE ~ FALSE
),.after = publisher)
steam %>%
filter(free_to_play == TRUE)
# Column to find all the VR stuff
steam <- steam %>%
mutate(virtual_reality_support = case_when(
str_detect(name, "VR") ~ TRUE,
str_detect(steamspy_tags,"VR") ~ TRUE,
TRUE ~ FALSE
),.after = free_to_play)
steam %>%
filter(virtual_reality_support == TRUE)
# Now the same but for multi-player - Local and Online
steam <- steam %>%
mutate(multiplayer = case_when(
str_detect(categories, "Local Multi-Player") & str_detect(categories, "Online Multi-Player") & str_detect(categories, "Co-op") ~ "Local + Online + Co-op",
str_detect(categories, "Local Multi-Player") & str_detect(categories, "Online Multi-Player") & !str_detect(categories, "Co-op") ~ "Local + Online",
str_detect(categories, "Local Multi-Player") & str_detect(categories, "Co-op") & !str_detect(categories, "Online Multi-Player") ~ "Local + Co-op",
str_detect(categories, "Online Multi-Player") & str_detect(categories, "Co-op") & !str_detect(categories, "Local Multi-Player") ~ "Online + Co-op",
str_detect(categories, "Local Multi-Player") & !str_detect(categories, "Online Multi-Player") ~ "Local only",
str_detect(categories, "Online Multi-Player") & !str_detect(categories, "Local Multi-Player") ~ "Online only",
str_detect(categories, "Multi-player") & str_detect(categories, "MMO") | str_detect(steamspy_tags, "MMO") ~ "Online only",
str_detect(categories, "Co-op") ~ "Co-op multiplayer",
str_detect(categories, "Multi-player") & !str_detect(categories, "Online Multi-Player") & !str_detect(categories, "Local Multi-Player") ~ "Multiplayer (Unspecified)",
TRUE ~ "No multiplayer"
),.after = virtual_reality_support)
# I think i could add more levels, but not right now
# remember exactly one comment ago when i said "local and online"
steam_checkpoint <- steam
to ratings (similar to how steam does it) __________
CHECKPOINT Don’t bother running anything above this - anything worth saving has been saved in
write_csv(steam_checkpoint, "raw_data/steam_checkpoint.csv")